In this vignette, we analyse a subset of the data from (Paul et al. 2015). A SingleCellExperiment object of the data has been provided with the tradeR package and can be retrieved with data(se). The data and UMAP reduced dimensions were derived from following the Monocle 3 vignette.
suppressPackageStartupMessages({
library(SingleCellExperiment)
library(mgcv)
library(tradeR)
library(slingshot)
library(RColorBrewer)
library(dplyr)
library(ggplot2)
library(tidyr)
})
palette(brewer.pal(8,"Dark2"))
data(se, package = "tradeR")
se
## class: SingleCellExperiment
## dim: 3004 2660
## metadata(0):
## assays(1): counts
## rownames(3004): 0610007L01Rik 0610009O20Rik ... Zzef1 rp9
## rowData names(1): gene_short_name
## colnames(2660): W31105 W31106 ... W39167 W39168
## colData names(22): Seq_batch_ID Amp_batch_ID ... no_expression
## louvain_component
## reducedDimNames(1): UMAP
## spikeNames(0):
We will fit developmental trajectories usling the slingshot package (Street et al. 2018). slingshot requires cluster labels as input, and fits trajectories in reduced dimension. We will use the reduced space calculated with the UMAP method, which is precalculated in the se object. We cluster the data using k-means with \(7\) clusters. Since we know which cells are the progenitor cell type, we define the starting point for the trajectories as input for slingshot. Note that this argument is optional, and not required to run slingshot.
set.seed(97)
rd <- reducedDims(se)$UMAP
cl <- kmeans(rd, centers = 7)$cluster
plot(rd, col = brewer.pal(9, "Set1")[cl], pch = 16, asp = 1,
cex = 2/3)
library(slingshot)
lin <- getLineages(rd, clusterLabels = cl, start.clus = 4)
## Using full covariance matrix
crv <- getCurves(lin)
plotGeneCount(rd = rd, curve = crv, counts = counts, clusters = cl)
We find two trajectories for this dataset. We can futher color the cells according to their cell type.
celltype <- factor(colData(se)$cell_type2)
plotGeneCount(rd = rd, curve = crv, counts = counts, clusters = celltype)
legend("topright", levels(celltype),
col = brewer.pal(9, "Set1")[1:nlevels(celltype)],
pch = 16, cex = 1 / 2, bty = "n")
After estimating the trajectories, we can fit generalized additive models (GAMs) with the tradeR package. Internally, this package builds on the mgcv package by fitting additive models using the gam function. The core function from tradeR, fitGAM will use cubic splines as basis functions, and it tries to ensure that every trajectory will end at a knot point of a basis function. By default, we allow for \(10\) knots for every trajectory, but this can be changed with the nknots function. By default, we estimate one smoothers for every trajectory in the GAM using the negative binomial distribution. If you want to allow for other fixed effects (e.g. batch effects), then an additional model matrix can be provided with the X argument.
We use the effective library size, estimated with TMM (M. D. Robinson and Oshlack 2010), as offset in the model. We allow for alternatives by providing a user-defined offset with the offset argument.
This dataset consists of UMI counts, and we do not expect zero inflation to be a big problem here. However, we also allow to fit zero inflated negative binomial (ZINB) GAMs by providing observation-level weights to fitGAM using the weights argument. The weights must correspond to the posterior probability that a count belongs to the zero inflation component of the ZINB distribution (Van den Berge et al. 2018).
For the vignette, we fit smoothers for a filtered set of genes in the dataset, 239. We also include the Irf8 gene, since it is a known transcription factor involved in hematopoiesis.
counts <- assays(se)$counts %>% as.matrix()
filt <- rowSums(counts > 8) > ncol(counts)/100
filt["Irf8"] <- T
counts <- counts[filt, ]
gamList <- fitGAM(counts = counts,
pseudotime = slingPseudotime(crv, na = FALSE),
cellWeights = slingCurveWeights(crv))
# This takes about 1mn to run
One may explore the results of a model by requesting its summary.
summary(gamList[[1]])
##
## Family: Negative Binomial(9.118)
## Link function: log
##
## Formula:
## y ~ -1 + s(t1, by = l1, bs = "cr", id = 1, k = nknots) + s(t2,
## by = l2, bs = "cr", id = 1, k = nknots) + offset(offset)
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(t1):l1 7.729 8.744 64854 <2e-16 ***
## s(t2):l2 6.687 7.501 78737 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.454 Deviance explained = 5.1%
## -REML = 4248.9 Scale est. = 1 n = 2660
A fist exploration of the data analysis may consist in checking whether a gene is associated with a particular lineage. The statistical test performed here is testing the null hypothesis that all smoother coefficients are equal to zero. One can check whether a gene is significantly associated with a particular lineage by extracting its p-value with the getSmootherPvalues function. This function extracts the p-values calculated by mgcv from the GAM, and will return NA for genes that we were unable to fit properly. Similarly, the test statistics may be extracted with getSmootherTestStats. Since this dataset was prefiltered to only contain relevant genes, all p-values (test statistics) will be very low (high).
pvalLineage <- getSmootherPvalues(gamList)
statLineage <- getSmootherTestStats(gamList)
In order to discover marker genes of the progenitor cell population, researchers may be interested in assessing differential expression between the progenitor cell population (i.e., the starting point of a trajectory) with the differentiated cell type population. In the function startPointTest, we have implemented a Wald test that tests the null hypothesis that the expression at the starting point of the smoother (progenitor population) is identical to the expression at the end point of the smoother (differentiated population). The test basically involves a comparison between two smoother coefficients for every lineage. The function startPointTest performs an omnibus across all trajectories by default, but you can also assess all pairwise comparisons separately by setting pairwise=TRUE. Below, we adopt an omnibus test across the two lineages.
startRes <- startPointTest(gamList)
We can visualize the estimated smoothers for the most significant gene.
oStart <- order(startRes$waldStat, decreasing = TRUE)
sigGeneStart <- names(gamList)[oStart[1]]
plotSmoothers(gamList[[sigGeneStart]])
Alternatively, we can color the cells in UMAP space with that gene’s expression.
plotGeneCount(rd, crv, counts, gene = sigGeneStart)
tradeR can discover marker genes for the differentiated cell types by comparing the end points of the lineage-specific smoothers. This is implemented in the endPointTest function. By default, endPointTest performs an omnibus test, testing the null hypothesis that the endpoint expression is equal for all trajectories using a multivariate Wald test. If more than two trajectories are present, one can assess all pairwise comparisons using the pairwise=TRUE argument.
endRes <- endPointTest(gamList)
We can plot the most significant gene using the plotSmoothers function.
o <- order(endRes$waldStat, decreasing = TRUE)
sigGene <- names(gamList)[o[2]]
plotSmoothers(gamList[[sigGene]])
Alternatively, we can color the cells in UMAP space with that gene’s expression.
plotGeneCount(rd, crv, counts, gene = sigGene)
Asides from testing at the level of the differentiated cell type level, researchers may be interested in assessing the expression pattern of a gene over time. The function patternTest implements a statistical method that tests the null hypothesis of whether all smoother coefficients are equal to each other. Due to our construction of the smoothers, which have identical smoothing penalties, knots and basis functions, genes with similar expression patterns will have similar smoothing coefficients. Hence, if all coefficients are similar, the expression patterns will be similar.
patternRes <- patternTest(gamList)
oPat <- order(patternRes$waldStat, decreasing = TRUE)
head(rownames(patternRes)[oPat])
## [1] "Mpo" "Prtn3" "Car2" "Ctsg" "Elane" "H2afy"
plotSmoothers(gamList[[rownames(patternRes)[oPat][1]]])
plotGeneCount(rd, crv, counts, gene = rownames(patternRes)[oPat][1])
We find genes at the top that are also ranked as DE for the differentiated cell type. What is especially interesting are genes which rank high for pattern differentiation but low for final stage differentation. We therefore sort the genes according to the sum of square of their rank in increasing Wald statistics for the patternTest and their rank in decreasing Wald statistics for the endPointTest.
compare <- inner_join(patternRes %>% mutate(Gene = rownames(patternRes),
pattern = waldStat) %>%
select(Gene, pattern),
endRes %>% mutate(Gene = rownames(endRes),
end = waldStat) %>%
select(Gene, end),
by = c("Gene" = "Gene")) %>%
mutate(transientScore = (min_rank(desc(end)))^2 +
(min_rank(pattern) - 1)^2)
ggplot(compare, aes(x = log(pattern), y = log(end))) +
geom_point(aes(col = transientScore)) +
labs(x = "endPointTest Wald Statistic (log scale)",
y = "patternTest Wald Statistic (log scale)") +
scale_color_continuous(low = "yellow", high = "red") +
theme_classic()
Or, we can visualize the expression in UMAP space of the top gene.
topTransient <- (compare %>% arrange(desc(transientScore)))[1, "Gene"]
plotSmoothers(gamList[[topTransient]])
plotGeneCount(rd, crv, counts, gene = topTransient)
Interestingly, we recover the Irf8 gene in the top 5 genes according to that rankin
head(compare %>% arrange(desc(transientScore)) %>% select(Gene), n = 5)
## Gene
## 1 Nedd4
## 2 Mki67
## 3 Mt1
## 4 Irf8
## 5 Hint1
We can also plot the Irf8 gene.
plotSmoothers(gamList[["Irf8"]])
plotGeneCount(rd, crv, counts, gene = "Irf8")
Another question of interest is to find a list of genes that are differentially expressed at the beginning of the separation of the two lineages.
The function earlyDrivers implements a statistical method to tests the null hypothesis of whether the first smoother coefficients are equal to each other.
earlyGenes <- earlyDrivers(gamList, output = "wald")
Similar to above, we can plot the estimated functions for the most significant gene, where there is a difference in the second coefficient.
oPat <- order(earlyGenes[,2], decreasing = TRUE)
head(rownames(earlyGenes)[oPat])
## [1] "Car1" "Ube2c" "Srgn" "Pkm2" "Coro1a" "Ctsg"
plotSmoothers(gamList[[rownames(earlyGenes)[oPat][1]]])
Or, we can visualize its expression in UMAP space.
plotGeneCount(rd, crv, counts, gene = rownames(patternRes)[oPat][1])
sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] bindrcpp_0.2.2 tidyr_0.8.1
## [3] ggplot2_3.0.0 dplyr_0.7.6
## [5] RColorBrewer_1.1-2 slingshot_0.99.13
## [7] bigmemory_4.5.33 princurve_2.1.2.1
## [9] tradeR_0.1.0 mgcv_1.8-24
## [11] nlme_3.1-137 SingleCellExperiment_1.3.10
## [13] SummarizedExperiment_1.11.6 DelayedArray_0.7.42
## [15] BiocParallel_1.15.12 matrixStats_0.54.0
## [17] Biobase_2.41.2 GenomicRanges_1.33.14
## [19] GenomeInfoDb_1.17.1 IRanges_2.15.18
## [21] S4Vectors_0.19.19 BiocGenerics_0.27.1
##
## loaded via a namespace (and not attached):
## [1] uuid_0.1-2 backports_1.1.2
## [3] copula_0.999-18 NMF_0.21.0
## [5] igraph_1.2.2 plyr_1.8.4
## [7] lazyeval_0.2.1 splines_3.5.0
## [9] pspline_1.0-18 crosstalk_1.0.0
## [11] rncl_0.8.3 gridBase_0.4-7
## [13] digest_0.6.16 foreach_1.4.4
## [15] htmltools_0.3.6 viridis_0.5.1
## [17] magrittr_1.5 memoise_1.1.0
## [19] cluster_2.0.7-1 doParallel_1.0.14
## [21] limma_3.37.7 annotate_1.59.1
## [23] stabledist_0.7-1 prettyunits_1.0.2
## [25] colorspace_1.3-2 blob_1.1.1
## [27] jsonlite_1.5 crayon_1.3.4
## [29] RCurl_1.95-4.11 bigmemory.sri_0.1.3
## [31] genefilter_1.63.2 bindr_0.1.1
## [33] phylobase_0.8.4 survival_2.42-6
## [35] iterators_1.0.10 ape_5.1
## [37] glue_1.3.0 registry_0.5
## [39] gtable_0.2.0 zlibbioc_1.27.0
## [41] XVector_0.21.3 webshot_0.5.0
## [43] kernlab_0.9-27 Rhdf5lib_1.3.3
## [45] prabclus_2.2-6 DEoptimR_1.0-8
## [47] HDF5Array_1.9.19 scales_1.0.0
## [49] mvtnorm_1.0-8 DBI_1.0.0
## [51] edgeR_3.23.4 rngtools_1.3.1
## [53] bibtex_0.4.2 miniUI_0.1.1.1
## [55] Rcpp_0.12.18 viridisLite_0.3.0
## [57] xtable_1.8-3 progress_1.2.0
## [59] bit_1.1-14 mclust_5.4.1
## [61] glmnet_2.0-16 htmlwidgets_1.2
## [63] httr_1.3.1 fpc_2.1-11.1
## [65] modeltools_0.2-22 pkgconfig_2.0.2
## [67] XML_3.98-1.16 flexmix_2.3-14
## [69] nnet_7.3-12 locfit_1.5-9.1
## [71] labeling_0.3 manipulateWidget_0.10.0
## [73] later_0.7.4 howmany_0.3-1
## [75] tidyselect_0.2.4 rlang_0.2.2
## [77] softImpute_1.4 reshape2_1.4.3
## [79] AnnotationDbi_1.43.1 munsell_0.5.0
## [81] tools_3.5.0 RSQLite_2.1.1
## [83] ade4_1.7-13 evaluate_0.11
## [85] stringr_1.3.1 yaml_2.2.0
## [87] knitr_1.20 bit64_0.9-7
## [89] robustbase_0.93-2 rgl_0.99.16
## [91] purrr_0.2.5 dendextend_1.8.0
## [93] mime_0.5 whisker_0.3-2
## [95] xml2_1.2.0 compiler_3.5.0
## [97] tibble_1.4.2 RNeXML_2.1.2
## [99] pcaPP_1.9-73 stringi_1.2.4
## [101] gsl_1.9-10.3 RSpectra_0.13-1
## [103] lattice_0.20-35 trimcluster_0.1-2.1
## [105] Matrix_1.2-14 pillar_1.3.0
## [107] ADGofTest_0.3 zinbwave_1.3.4
## [109] data.table_1.11.4 bitops_1.0-6
## [111] httpuv_1.4.5 R6_2.2.2
## [113] promises_1.0.1 gridExtra_2.3
## [115] codetools_0.2-15 MASS_7.3-50
## [117] assertthat_0.2.0 rhdf5_2.25.11
## [119] pkgmaker_0.27 rprojroot_1.3-2
## [121] withr_2.1.2 GenomeInfoDbData_1.1.0
## [123] locfdr_1.1-8 diptest_0.75-7
## [125] hms_0.4.2 grid_3.5.0
## [127] class_7.3-14 rmarkdown_1.10
## [129] shiny_1.1.0 numDeriv_2016.8-1
## [131] clusterExperiment_2.1.6 lubridate_1.7.4
Paul, Franziska, Ya’ara Arkin, Amir Giladi, Diego Adhemar Jaitin, Ephraim Kenigsberg, Hadas Keren-Shaul, Deborah Winter, et al. 2015. “Transcriptional Heterogeneity and Lineage Commitment in Myeloid Progenitors.” Cell 163 (7). Cell Press: 1663–77. doi:10.1016/J.CELL.2015.11.013.
Robinson, Mark D, and Alicia Oshlack. 2010. “A scaling normalization method for differential expression analysis of RNA-seq data.” Genome Biology 11 (3). BioMed Central: R25. doi:10.1186/gb-2010-11-3-r25.
Street, Kelly, Davide Risso, Russell B. Fletcher, Diya Das, John Ngai, Nir Yosef, Elizabeth Purdom, and Sandrine Dudoit. 2018. “Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics.” BMC Genomics 19 (1). BioMed Central: 477. doi:10.1186/s12864-018-4772-0.
Van den Berge, Koen, Fanny Perraudeau, Charlotte Soneson, Michael I. Love, Davide Risso, Jean-Philippe Vert, Mark D. Robinson, Sandrine Dudoit, and Lieven Clement. 2018. “Observation weights unlock bulk RNA-seq tools for zero inflation and single-cell applications.” Genome Biology 19 (1). BioMed Central: 24. doi:10.1186/s13059-018-1406-4.